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Abstract 

Most  traditional  information  extraction 
approaches  are  generative  models  that  assume 
events  exist  in  text  in  certain  patterns  and  these 
patterns  can  be  regenerated  in  various  ways. 
These  assumptions  limited  the  syntactic  clues 
being  considered  for  finding  an  event  and 
confined  these  approaches  to  a  particular 
syntactic  level.  This  paper  presents  a 
discriminative  framework  based  on  kernel  S  VMs 
that  takes  into  account  different  levels  of 
syntactic  information  and  automatically 
identifies  the  appropriate  clues.  Kernels  are  used 
to  represent  certain  levels  of  syntactic  structure 
and  can  be  combined  in  principled  ways  as  input 
for  an  SVM.  We  will  show  that  by  combining  a 
low  level  sequence  kernel  with  a  high  level 
kernel  on  a  GLARF  dependency  graph,  the  new 
approach  outperformed  a  good  rule-based 
system  on  slot  filler  detection  for  MUC-6. 

1  Introduction 

The  goal  of  Information  Extraction  (IE)  is  to 
extract  structured  facts  of  interest  from  text  and 
present  them  in  databases  or  templates.  Much  of 
the  IE  research  was  promoted  by  the  US 
Government-sponsored  MUCs  (Message 
Understanding  Conferences).  The  techniques  used 
by  Information  Extraction  depend  greatly  on  the 
sublanguage  used  in  a  domain,  such  as  financial 
news  or  medical  records.  The  training  data  for  an 
IE  system  is  often  sparse  since  the  target  domain 
changes  quickly.  Traditional  IE  approaches  try  to 
generate  patterns  for  events  by  various  means 
using  training  data.  Eor  example,  the  EASTUS 
(Appelt  et  al.,  1996)  and  Proteus  (Grishman,  1996) 
systems,  which  performed  well  for  MUC-6,  used 
hand- written  rules  for  event  patterns.  The  symbolic 
learning  systems,  like  AutoSlog  (Riloff,  1993)  and 
CRYSTAL  (Eisher  et  al.,  1996),  generated  patterns 
automatically  from  specific  examples  (text 
segments)  using  generalization  and  predefined 
pattern  templates.  There  are  also  statistical 
approaches  (Miller  et  al.,  1998)  (Collins  et  al., 
1998)  trying  to  encode  event  patterns  in  statistical 
CEG  grammars.  All  of  these  approaches  assume 


events  occur  in  text  in  certain  patterns.  However 
this  assumption  may  not  be  completely  correct  and 
it  limits  the  syntactic  information  considered  by 
these  approaches  for  finding  events,  such  as 
information  on  global  features  from  levels  other 
than  deep  processing.  This  paper  will  show  that  a 
simple  bag-of-words  model  can  give  us  reliable 
information  about  event  occurrence.  When  training 
data  is  limited,  these  other  approaches  may  also  be 
less  effective  in  their  ability  to  generate  reliable 
patterns. 

The  idea  for  overcoming  these  problems  is  to 
avoid  making  any  prior  assumption  about  the 
syntactic  structure  an  event  may  assume;  instead, 
we  should  consider  all  syntactic  features  in  the 
target  text  and  use  a  discriminative  classifier  to 
decide  that  automatically.  Discriminative 
classifiers  make  no  attempt  to  resolve  the  structure 
of  the  target  classes.  They  only  care  about  the 
decision  boundary  to  separate  the  classes.  In  our 
case,  we  only  need  criteria  to  predict  event 
elements  from  text  using  the  syntactic  features 
provided.  This  seems  a  more  suitable  solution  for 
IE  where  training  data  is  often  sparse. 

This  paper  presents  an  approach  that  uses  kernel 
functions  to  represent  different  levels  of  syntactic 
structure  (information).  With  the  properties  of 
kernel  functions,  individual  kernels  can  be 
combined  freely  into  comprehensive  kernels  that 
cross  syntactic  levels.  The  classifier  we  chose  to 
use  is  SVM  (Support  Vector  Machine),  mostly  due 
to  its  ability  to  work  in  high  dimensional  feature 
spaces.  The  experimental  results  of  this  approach 
show  that  it  can  outperform  a  hand-crafted  rule 
system  for  the  MUC-6  management  succession 
domain. 

2  Background 

2.1  Information  Extraction 

The  major  task  of  IE  is  to  find  the  elements  of  an 
event  from  text  and  combine  them  to  form 
templates  or  populate  databases.  Most  of  these 
elements  are  named  entities  (NEs)  involved  in  the 
event.  To  determine  which  entities  in  text  are 
involved,  we  need  to  find  reliable  clues  around 
each  entity.  The  extraction  procedure  starts  with 


1 


Report  Documentation  Page 

Form  Approved 

0MB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 

1.  REPORT  DATE 

2004 

2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2004  to  00-00-2004 

4.  TITLE  AND  SUBTITLE 

5a.  CONTRACT  NUMBER 

Discriminative  Slot  Detection  Using  Kernel  Methods 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Department  of  Computer  Science  ,New  York  University, 715 

Broadway, New  York, NY, 10003 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

18.  NUMBER 

OF  PAGES 

7 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98} 

Prescribed  by  ANSI  Std  Z39-18 


text  preprocessing,  ranging  from  tokenization  and 
part-of-speech  tagging  to  NE  identification  and 
parsing.  Traditional  approaches  would  use  various 
methods  of  analyzing  the  results  of  deep 
preprocessing  to  find  patterns.  Here  we  propose  to 
use  support  vector  machines  to  identify  clues 
automatically  from  the  outputs  of  different  levels 
of  preprocessing. 

2.2  Support  Vector  Machine 

For  a  two-class  classifier,  with  separable  training 
data,  when  given  a  set  of  n  labeled  vector  examples 

{X„y,UX„y,),...,{X„,yJ,  y,e{+l,-l}, 

a  support  vector  machine  (Vapnik,  1998)  produces 
the  separating  hyperplane  with  largest  margin 
among  all  the  hyperplanes  that  successfully 
classify  the  examples.  Suppose  that  all  the 
examples  satisfy  the  following  constraint: 
}',.X(<1V,Z,.  >+b)>\ 

It  is  easy  to  see  that  the  margin  between  the  two 
bounding  hyperplanes  <W ,X^  >+b  =  +\  is 

2/1 1  Wl  I .  So  maximizing  the  margin  is  equivalent  to 
minimizing  IIWII^  subject  to  the  separation 
constraint  above.  In  machine  learning  theory,  this 
margin  relates  to  the  upper  bound  of  the  VC- 
dimension  of  a  support  vector  machine.  Increasing 
the  margin  reduces  the  VC-dimension  of  the 
learning  system,  thus  increasing  the  generalization 
capability  of  the  system.  So  a  support  vector 
machine  produces  a  classifier  with  optimal 
generalization  capability.  This  property  enables 
SVMs  to  work  in  high  dimensional  vector  spaces. 

2.3  Kernel  SVM 

The  vectors  in  SVM  are  usually  feature  vectors 
extracted  by  a  certain  procedure  from  the  original 
objects,  such  as  images  or  sentences.  Since  the 
only  operator  used  in  SVM  is  the  dot  product 
between  two  vectors,  we  can  replace  this  operator 
by  a  function  (p{S-,S-)  on  the  object  domain.  In 

our  case,  5,  and  5)  are  sentences.  Mathematically 
this  is  still  valid  as  long  as  (p{S-,Sj)  satisfies 

Mercer’s  condition k  Function  (p{S-,Sj)  is  often 

referred  to  as  a  kernel  function  or  just  a  kernel. 
Kernel  functions  provide  a  way  to  compute  the 
similarity  between  two  objects  without 
transforming  them  into  features. 

The  kernel  set  has  the  following  properties: 


1 .  If  K, (x,  y)  and  (x,  y)  are  kernels  on  XxY  , 

«,  y?  >  0 ,  then  oKj  {x,y)  +  (x,  y)  is  a  kernel 

onXxY . 

2.  If  Kj(x, y)  and  y)are  kernels  on  XxY  , 

then  Kj  (x,  y)  x  (x,  y)  is  a  kernel  on  XxY  . 

3.  If  Kj(x,  y)  is  a  kernel  on  XxY  and 
K^(u,v)  is  a  kernel  on  U  xV  ,  then 
K((x,u),  (y,  v))  =  Kj(x,  y)  -I-  K^u,  v)  is  a  kernel 
on  {XxU)x(YxV). 

When  we  have  kernels  representing  information 
from  different  sources,  these  properties  enable  us 
to  incorporate  them  into  one  kernel.  The  general 
kernels  such  as  RBF  or  polynomial  kernels  (Muller 
et  ah,  2001),  which  extend  features  nonlinearly 
into  higher  dimensional  space,  can  also  be  applied 
to  either  the  combination  kernel  or  to  each 
component  kernel  individually. 

2.4  Related  Work 

There  have  been  a  number  of  SVM  applications 
in  NFP  using  particular  levels  of  syntactic 
information.  (Fodhi  et  ah,  2002)  compared  a  word- 
based  string  kernel  and  n-gram  kernels  at  the 
sequence  level  for  a  text  categorization  task.  The 
experimental  results  showed  that  the  n-gram 
kernels  performed  quite  well  for  the  task.  Although 
string  kernels  can  capture  common  word 
subsequences  with  gaps,  its  geometric  penalty 
factor  may  not  be  suitable  for  weighting  the  long 
distance  features.  (Collins  et  ah,  2001)  suggested 
kernels  on  parse  trees  and  other  structures  for 
general  NFP  tasks.  These  kernels  count  small 
subcomponents  multiple  times  so  that  in  practice 
one  has  to  be  careful  to  avoid  overfitting.  This  can 
be  achieved  by  limiting  the  matching  depth  or 
using  a  penalty  factor  to  downweight  large 
components. 

(Zelenko  et  al.,  2003)  devised  a  kernel  on 
shallow  parse  trees  to  detect  relations  between 
named  entities,  such  as  the  person- affiliation 
relation  between  a  person  name  and  an 
organization  name.  The  so-called  relation  kernel 
matches  from  the  roots  of  two  trees  and  continues 
recursively  to  the  leaf  nodes  if  the  types  of  two 
nodes  match. 

All  the  kernels  used  in  these  works  were  applied 
to  a  particular  syntactic  level.  This  paper  presents 
an  approach  for  information  extraction  that  uses 
kernels  to  combine  information  from  different 
levels  and  automatically  identify  which 
information  contributes  to  the  task.  This 
framework  can  also  be  applied  to  other  NFP  tasks. 


*  The  matrix  must  be  positive  semi-definite 
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3  A  Discriminative  Framework 

The  discriminative  framework  proposed  here  is 
called  ARES  (Automated  Recognition  of  Event 
Slots).  It  makes  no  assumption  about  the  text 
structure  of  events.  Instead,  kernels  are  used  to 
represent  syntactic  information  from  various 
syntactic  sources.  The  structure  of  ARES  is  shown 
in  Eig  1.  The  preprocessing  modules  include  a 
part-of-speech  tagger,  name  tagger,  sentence  parser 
and  GLARE  parser,  hut  are  not  limited  to  these. 
Other  general  tools  can  also  he  included,  which  are 
not  shown  in  the  diagram.  The  triangles  in  the 
diagram  are  kernels  that  encode  the  corresponding 
syntactic  processing  result.  In  the  training  phase, 
the  target  slot  fillers  are  labeled  in  the  text  so  that 
SVM  slot  detectors  can  be  trained  through  the 
kernels  to  find  fillers  for  the  key  slots  of  events.  In 
the  testing  phase,  the  SVM  classifier  will  predict 
the  slot  fillers  from  unlabeled  text  and  a  merging 
procedure  will  merge  slots  into  events  if  necessary. 
The  main  kernel  we  propose  to  use  is  on  GLARE 
(Meyers  et  ah,  2001)  dependency  graphs. 


Eig  1 .  Structure  of  the  discriminative  model 

The  idea  is  that  an  IE  model  should  not  commit 
itself  to  any  syntactic  level.  The  low  level 
information,  such  as  word  collocations,  may  also 
give  us  important  clues.  Our  experimentation  will 
show  that  for  the  MUC-6  management  succession 
domain,  even  bag-of-words  or  n-grams  can  give  us 
helpful  information  about  event  occurrence. 

3.1  Syntactic  Kernels 

To  make  use  of  syntactic  information  from 
different  levels,  we  can  develop  kernel  functions  or 
syntactic  kernels  to  represent  a  certain  level  of 
syntactic  structure.  The  possible  syntactic  kernels 
include 

•  Sequence  kernels:  representing  sequence 
level  information,  such  as  bag-of-words,  n- 
grams,  string  kernel,  etc. 

•  Phrase  kernel:  representing  information  at 
an  intermediate  level,  such  as  kernels 
based  on  multiword  expressions,  chunks  or 
shallow  parse  trees. 


•  Parsing  kernel:  representing  detailed 
syntactic  structure  of  a  sentence,  such  as 
kernels  based  on  parse  trees  or  dependency 
graphs. 

These  kernels  can  be  used  alone  or  combined 
with  each  other  using  the  properties  of  kernels. 
They  can  also  be  combined  with  high-order  kernels 
like  polynomial  or  RBE  kernels,  either  individually 
or  on  the  resulting  kernel. 

As  the  depth  of  analysis  of  the  preprocessing 
increases,  the  accuracy  of  the  result  decreases. 
Combining  the  results  of  deeper  processing  with 
those  of  shallower  processing  (such  as  n-grams) 
can  also  give  us  a  back-off  ability  to  recover  from 
errors  in  deep  processing. 

In  practice  each  kernel  can  be  tested  for  the  task 
as  the  sole  input  to  an  SVM  to  determine  if  this 
level  of  information  is  helpful  or  not.  After 
figuring  out  all  the  useful  kernels,  we  can  try  to 
combine  them  to  make  a  comprehensive  kernel  as 
final  input  to  the  classifier.  The  way  to  combine 
them  and  the  parameters  in  combination  can  be 
determined  using  validation  data. 

4  Introduction  to  GLARF 

GLARE  (Grammatical  and  Logical  Argument 
Regularization  Eramework)  [Meyers  et  ah,  2001]  is 
a  hand-coded  system  that  produces  comprehensive 
word  dependency  graphs  from  Penn  TreeBank-II 
(PTB-II)  parse  trees  to  facilitate  applications  like 
information  extraction.  GLARE  is  designed  to 
enhance  PTB-II  parsing  to  produce  more  detailed 
information  not  provided  by  parsing,  such  as 
information  about  object,  indirect  object  and 
appositive  relations.  GLARE  can  capture  more 
regularization  in  text  by  transforming  non- 
canonical  (passive,  filler-gap)  constructions  into 
their  canonical  forms  (simple  declarative  clauses). 
This  is  very  helpful  for  information  extraction 
where  training  data  is  often  sparse.  It  also 
represents  all  syntactic  phenomena  in  uniform 
typed  PRED-ARG  structures,  which  is  convenient 
for  computational  purposes.  Eor  a  sentence, 
GLARE  outputs  depencency  triples  derived 
automatically  from  the  GLARE  typed  feature 
structures  [Meyers  et  ah,  2001].  A  directed 
dependency  graph  of  the  sentence  can  also  be 
constructed  from  the  depencency  triples.  The 
following  is  the  output  of  GLARE  for  the  sentence 
“Tom  Donilon,  who  also  could  get  a  senior  job 

<SBJ,  get,  Tom  Donilon> 

<OBJ,  get,  job> 

<ADV,  get,  also> 

<AUX,  get,  could> 

<T-POS,  job,  a> 
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<A-POS,  job,  senior> 

GLARF  can  produce  logical  relations  in  addition 
to  surface  relations,  which  is  helpful  for  IE  tasks.  It 
can  also  generate  output  containing  the  base  form 
of  words  so  that  different  tenses  of  verbs  can  be 
regularized.  Because  of  all  these  features,  our  main 
kernels  are  based  on  the  GLARF  dependency 
triples  or  dependency  graphs. 

5  Event  and  Slot  Kernels 

Here  we  will  introduce  the  kernels  used  by  ARES 
for  event  occurrence  detection  (EOD)  and  slot 
filler  detection  (SFD). 

5.1  EOD  Kernels 

In  Information  Extraction,  one  interesting  issue 
is  event  occurrence  detection,  which  is  determining 
whether  a  sentence  contains  an  event  occurrence  or 
not.  If  this  information  is  given,  it  would  be  much 
easier  to  find  the  relevant  entities  for  an  event  from 
the  current  sentence  or  surrounding  sentences. 
Traditional  approaches  do  matching  (for  slot 
filling)  on  all  sentences,  even  though  most  of  them 
do  not  contain  any  event  at  all.  Event  occurrence 
detection  is  similar  to  sentence  level  information 
retrieval,  so  simple  models  like  bag-of-words  or  n- 
grams  could  work  well.  We  tried  two  kernels  to  do 
this,  one  is  a  sequence  level  n-gram  kernel  and  the 
other  is  a  GLARF-based  kernel  that  matches 
syntactic  details  between  sentences.  In  the 
following  formulae,  we  will  use  an  identity 
function  I{x,  y)  that  gives  1  when  x  =  y  and  0 
otherwise,  where  x  and  y  are  strings  or  vectors  of 
strings. 

1.  N-gram  kernel  that  counts  common 

n-grams  between  two  sentences.  Given  two 
sentence:  5;  =<WpW2,..J%^>,  and  S^=<w^,W2,■■Wf^>, 
a  bigram  kernel  (p^^{S^,S2)  is 

]V|-1  X,-! 
i=l  ;=1 

Kernels  can  be  inclusive,  in  other  words,  the 
trigram  kernel  includes  bigrams  and  unigrams.  For 
the  unigram  kernel  a  stop  list  is  used  that  removes 
words  other  than  nouns,  verbs,  adjectives  and 
adverbs. 

2.  Glarf  kernel  (p^  (Gj ,  G2 ) :  this  kernel  is  based 

on  the  GLARF  dependency  result.  Given  the  triple 
outputs  of  two  sentences  produced  by 
GLARF:  Gj  ={<r.,  p. ,a->}  ,  1  <  /  <  N^  and 


<^2  /  j-^2’  where  r„  a, 

correspond  to  the  role  label,  predicate  word  and 
argument  word  respectively  in  GLARF  output,  it 
matches  the  two  triples,  their  predicates  and 
arguments  respectively.  So  (p^  (Gj ,  G2 )  equals 

ZZ(/«,,«,,>,<rpPpa,  >)+aI{p,p)+(5l{a„a)) 

i=l  7=1 

In  our  experiments,  a  and  P  were  set  to  I. 

5.2  SFD  Kernels 

Slot  filler  detection  (SFD)  is  the  task  of 
determining  which  named  entities  fill  a  slot  in 
some  event  template.  Two  kernels  were  proposed 
for  SFD:  the  first  one  matches  local  contexts  of 
two  target  NEs,  while  the  second  one  combines  the 
first  one  with  an  n-gram  EOD  kernel. 

1.  (p^ sfd{G^,G  ■) :  This  kernel  was  also  defined 

on  a  GLARF  dependency  graph  (DG),  a  directed 
graph  constructed  from  its  typed  PRED-ARG 
outputs.  The  arcs  labeled  with  roles  go  from 
predicate  words  to  argument  words.  This  kernel 
matches  local  context  surrounding  a  name  in  a 
GLARF  dependency  graph.  In  preprocessing,  all 
the  names  of  the  same  type  are  translated  into  one 
symbol  (a  special  word).  The  matching  starts  from 
two  anchor  nodes  (NE  nodes  of  the  same  type)  in 
the  two  DG’s  and  recursively  goes  from  these 
nodes  to  their  successors  and  predecessors,  until 
the  words  associated  with  nodes  do  not  match.  In 
our  experiment,  the  matching  depth  was  set  to  2. 
Each  node  n  contains  a  predicate  word  w  and 
relation  pairs  {<  r^,a^  >} ,  \  <i<  p  representing 
its  p  arguments  and  the  roles  associated  with  them. 
A  matching  function  C(nj,n2)  is  defined  as 

Z  >)  +  Kr„rp). 

;=1  ;=1 

Then  cp^  SFD  {Gj ,  G j ) :  can  be  written  as 
C(Ej,Ep+  ZCK,^) 

n^e  Succ(Ej )  n^ePred  (E^ ) 

rijSSucc  (Ej)  njSPred  (E j  ) 

n I =n  j  n I =n  j 

where  £,  and  £,  are  the  anchor  nodes  in  the  two 
DG’ s;  Uj  =  tij  is  true  if  the  predicate  words 

associated  with  them  match.  Functions  Succ(n)  and 
Pred(n)  give  the  successor  and  predecessor  node 
set  of  a  node  n.  The  reason  for  setting  a  depth  limit 
is  that  it  covers  most  of  the  local  syntax  of  a  node 
(before  matching  stops);  another  reason  is  that  the 
cycles  currently  present  in  GLARF  dependency 
graph  prohibit  unbounded  recursive  matching. 

2.  (p^sFD {Sj ,S p.  This  kernel  combines  linearly 
the  n-gram  event  kernel  and  the  slot  kernel  above. 
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in  the  hope  that  the  general  event  occurrence 
information  provided  hy  EOD  kernel  can  help  the 
slot  kernel  to  ignore  NEs  in  sentences  that  do  not 
contain  any  event  occurrence. 

^^sFDiSi,S j)  -  j)  +  , 

where  a,  P  were  set  to  he  1  in  our  experiments. 
The  Glarf  event  kernel  was  not  used,  simply 
because  it  uses  information  from  the  same  source 
as  (p^ SFD{G^,G j) .  The  n-gram  kernel  was  chosen 

to  he  the  trigram  kernel,  which  gives  us  the  best 
EOD  performance  among  n-gram  kernels. 

We  also  tried  the  dependency  graph  kernel 
proposed  by  (Collins  et  ah,  2001),  but  it  did  not 
give  us  better  result. 

6  Experiments 

6.1  Corpus 

The  experiments  of  ARES  were  done  on  the 
MUC-6  corporate  management  succession  domain 
using  the  official  training  data  and,  for  the  final 
experiment,  the  official  test  data  as  well.  The 
training  data  was  split  into  a  training  set  (80%)  and 
validation  set  (20%).  In  ARES,  the  text  was 
preprocessed  by  the  Proteus  NE  tagger  and 
Charniak  sentence  parser.  Then  the  GLARE 
processor  produced  dependency  graphs  based  on 
the  parse  trees  and  NE  results.  All  the  names  were 
transformed  into  symbols  representing  their  types, 
such  as  #PERSON#  for  all  person  names.  The 
reason  is  that  we  think  the  name  itself  does  not 
provide  a  significant  clue;  the  only  thing  that 
matters  is  what  type  of  name  occurs  at  certain 
position. 

Two  tasks  have  been  tried:  one  is  EOD  (event 
occurrence  detection)  on  sentences;  the  other  is 
SED  (slot  filler  detection)  on  named  entities, 
including  person  names  and  job  titles.  EOD  is  to 
determine  whether  a  sentence  contains  an  event  or 
not.  This  would  give  us  general  information  about 
sentence-level  event  occurrences.  SED  is  to  find 
name  fillers  for  event  slots.  The  slots  we 
experimented  with  were  the  person  name  and  job 
title  slots  in  MUC-6.  We  used  the  SVM  package 
svm'*®’’'  in  our  experiments,  embedding  our  own 
kernels  as  custom  kernels. 

6.2  EOD  Experiments 

In  this  experiment,  ARES  was  trained  on  the 
official  MUC-6  training  data  to  do  event 
occurrence  detection.  The  data  contains  1940 
sentences,  of  which  158  are  labeled  as  positive 
instances  (contain  an  event).  Eive-fold  cross 
validation  was  used  so  that  the  training  and  test  set 
contain  80%  and  20%  of  the  data  respectively. 


Three  kernels  defined  in  the  previous  section  were 
tried.  Table  1  shows  the  performance  of  each 
kernel.  Three  n-gram  kernels  were  tested:  unigram, 
bigram  and  trigram.  Subsequences  longer  than 
trigrams  were  also  tried,  but  did  not  yield  better 
results. 

The  results  show  that  the  trigram  kernel 
performed  the  best  among  n-gram  kernels.  GLARE 
kernel  did  better  than  n-gram  kernels,  which  is 
reasonable  because  it  incorporates  detailed  syntax 
of  a  sentence.  But  generally  speaking,  the  n-gram 
kernels  alone  performed  fairly  well  for  this  task, 
which  indicates  that  low  level  text  processing  can 
also  provide  useful  information.  The  mix  kernel 
that  combines  the  trigram  kernel  with  GLARE 
kernel  gave  the  best  performance,  which  might 
indicate  that  the  low  level  information  provides 
additional  clues  or  helps  to  overcome  errors  in 
deep  processing. 


Kernel 

Precision 

Recall 

F-score 

Unigram 

66.0% 

66.5% 

66.3% 

Bigram 

73.9% 

60.3% 

66.4% 

Trigram 

77.5% 

61.5% 

68.6% 

GLARF 

77.5% 

63.9% 

70.1% 

Mix 

81.5% 

66.4% 

73.2% 

Table  1.  EOD  performance  of  ARES  using 
different  kernels.  The  Mix  kernel  is  a  linear 
combination  of  the  trigram  kernel  and  the  Glarf 
kernel. 

6.3  SED  Experiments 

The  slot  filler  detection  (SED)  task  is  to  find  the 
named  entities  in  text  that  can  fill  the 
corresponding  slots  of  an  event. ^  We  treat  job  title 
as  a  named  entity  throughout  this  paper,  although  it 
is  not  included  in  the  traditional  MUG  named 
entity  set.  The  slots  we  used  for  evaluation  were 
PERSON_IN  (the  person  who  took  a  position), 
PERSON_OUT  (the  person  who  left  a  position) 
and  POST  (the  position  involved).  We  generated 
the  two  person  slots  from  the  official  MUC-6 
templates  and  the  corresponding  filler  strings  in 
text  were  labeled.  Three  SVM  predictors  were 
trained  to  find  name  fillers  of  each  slot.  Two 
experiments  have  been  tried  on  MUC-6  training 
data  using  five-fold  cross  validation. 

The  first  experiment  of  ARES  used  slot  kernel 
^*5fd(G,, Gy)  alone,  relying  solely  on  local 


^  We  used  this  task  for  evaluation,  rather  than  the 
official  MUC  template-filling  task,  in  order  to  assess  the 
system’ s  ability  to  identify  slot  fillers  separately  from  its 
ability  to  combine  them  into  templates. 
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context  around  a  NE.  From  the  performance  table 
(Table  2),  we  can  see  that  local  context  can  give  a 
fairly  good  clue  for  finding  PERSON_IN  and 
POST,  but  not  for  PERSON_OUT.  The  main 
reason  is  that  local  context  might  be  not  enough  to 
determine  a  PERSON_OUT  filler.  It  often  requires 
inference  or  other  semantic  information.  For 
example,  the  sentence  “Aaron  Spelling,  the 
company's  vice  president,  was  named  president.”, 
indicates  that  “Aaron  Spelling”  left  the  position  of 
vice  president,  therefore  it  should  be  a 
PERSON_OUT.  But  the  sentence  “Aaron  Spelling, 
the  company's  vice  president,  said  ...”,  which  is 
very  similar  to  first  one  in  syntax,  has  no  such 
indication  at  all.  In  complicated  cases,  a  person  can 
even  hold  two  positions  at  the  same  time. 


1  Accuracy 

Precision 

Recall 

E- score  I 

PER_IN 

63.6% 

62.5% 

63.1% 

PER„OUT 

54.8% 

54.2% 

54.5% 

POST 

64.4% 

55.2% 

59.4% 

Table  2.  SFD  performance  of  ARES  using  kernel 

(P^SFd{Gi,Gj)  ■ 


In  this  experiment,  the  SVM  predictor 
considered  all  the  names  identified  by  the  NE 
tagger;  however,  most  of  the  sentences  do  not 
contain  an  event  occurrence  at  all,  so  NEs  in  these 
sentences  should  be  ignored  no  matter  what  their 
local  context  is.  To  achieve  this  we  need  general 
information  about  event  occurrence,  and  this  is  just 
what  the  EOD  kernel  can  provide.  In  our  second 
experiment,  we  tested  the  kernel  (P^sfd{S.,S j)  , 
which  is  a  linear  combination  of  the  trigram  EOD 
kernel  and  the  SFD  kernel  )  .  Table  3 

shows  the  performance  of  the  combination  kernel, 
from  which  we  can  see  that  there  is  clear 
performance  improvement  for  all  three  slots.  We 
also  tried  to  use  the  mix  kernel  which  gave  us  the 
best  EOD  performance,  but  it  did  not  yield  a  better 
result.  The  reason  we  think  is  that  the  GLARE 
EOD  kernel  and  SFD  kernel  are  from  the  same 
syntactic  source,  so  the  information  was  repeated. 


Accuracy 

Precision 

PERJN 

86.6% 

60.5% 

71.2% 

PER„OUT 

69.2% 

58.2% 

63.2% 

POST 

68.5% 

68.9% 

68.7% 

Table  3.  SFD  performance  of  ARES  using  kernel 
^^sfd{S.,S j).  It  combines  the  Glarf  SFD  kernel 

with  trigram  EOD  kernel.  For  PER_OUT, 
uni  gram  EOD  kernel  was  used. 


Since  five-fold  cross  validation  was  used,  ARES 
was  trained  on  80%  of  the  MUC-6  training  data  in 
these  two  experiments. 

6.4  Comparison  with  MUC-6  System 

This  experiment  was  done  on  the  official  MUC- 
6  training  and  test  data,  which  contain  50K  words 
and  40K  words  respectively.  ARES  used  the 
official  corpora  as  training  and  test  sets,  except  that 
in  the  training  data,  all  the  slot  fillers  were 
manually  labeled.  We  compared  the  performance 
of  ARES  with  the  NYU  Proteus  system,  a  rule- 
based  system  that  performed  well  for  MUC-6.  To 
score  the  performance  for  these  three  slots,  we 
generated  the  slot-filler  pairs  as  keys  for  a 
document  from  the  official  MUC-6  templates  and 
removed  duplicate  pairs.  The  scorer  matches  the 
filler  string  in  the  response  file  of  ARES  to  the 
keys.  The  response  result  for  Proteus  was 
extracted  in  the  same  way  from  its  template  output. 
Table  4.  shows  the  result  of  ARES  using  the 
combination  kernel  in  the  previous  experiment. 


1  Accuracy  I 

1  Precision  I 

PERJN 

77.3% 

62.2% 

68.9% 

PER_OUT 

58.9% 

69.7% 

63.9% 

POST 

77.1% 

71.5% 

73.6% 

Table  4.  Slot  performance  ARES  using  kernel 
(P^sfd{S.,S j)  on  MUC-6  test  data. 


Table  5  shows  the  test  result  of  the  Proteus 
system.  Comparing  the  numbers  we  can  see  that 
for  slot  PERSONJN  and  POST,  ARES 
outperformed  the  Proteus  system  by  a  few  points. 
The  result  is  promising  considering  that  this  model 
is  fully  automatic  and  does  not  involve  any  post¬ 
processing.  As  for  the  PERSON_OUT  slot,  the 
performance  of  ARES  was  not  as  good.  As  we 
have  discussed  before,  relying  purely  on  syntax 
might  not  help  us  much;  we  may  need  an 
inference  model  to  resolve  this  problem. 


1  Accuracy  I 

1  Precision  I 

PERJN 

85.7% 

51.2% 

64.1% 

PER_OUT 

78.4% 

58.6% 

67.1% 

POST 

83.3% 

59.7% 

69.5% 

Table  5.  Slot  performance  of  the  rule-based 
Proteus  system  for  MUC-6. 

7  Related  Work 
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(Chieu  et  ah,  2003)  reported  a  feature-based 
SVM  system  (ALICE)  to  extract  MUC-4  events  of 


terrorist  attacks.  The  Alice-ME  system 
demonstrated  competitive  performance  with  rule- 
hased  systems.  The  features  used  hy  Alice  are 
mainly  from  parsing.  Comparing  with  ALICE,  our 
system  uses  kernels  on  dependency  graphs  to 
replace  explicit  features,  an  approach  which  is 
fully  automatic  and  requires  no  enumeration  of 
features.  The  model  we  proposed  can  combine 
information  from  different  syntactic  levels  in 
principled  ways.  In  our  experiments,  we  used  both 
word  sequence  information  and  parsing  level 
syntax  information.  The  training  data  for  ALICE 
contains  1700  documents,  while  for  our  system  it 
is  just  100  documents.  When  data  is  sparse,  it  is 
more  difficult  for  an  automatic  system  to 
outperform  a  rule-based  system  that  incorporates 
general  knowledges. 

8  Discussion  and  Further  Works 

This  paper  describes  a  discriminative  approach 
that  can  use  syntactic  clues  automatically  for  slot 
filler  detection.  It  outperformed  a  hand-crafted 
system  on  sparse  data  by  considering  different 
levels  of  syntactic  clues.  The  result  also  shows  that 
low  level  syntactic  information  can  also  come  into 
play  in  finding  events,  thus  it  should  not  be  ignored 
in  the  IE  framework. 

Eor  slot  filler  detection,  several  classifiers  were 
trained  to  find  names  for  each  slot  and  there  is  no 
correlation  among  these  classifiers.  However, 
entity  slots  in  events  are  often  strongly  correlated, 
for  example  the  PERJN  and  POST  slots  for 
management  succession  events.  Since  these 
classifiers  take  the  same  input  and  produce 
different  results,  correlation  models  can  be  used  to 
integrate  these  classifiers  so  that  the  identification 
of  slot  fillers  might  benefit  each  other. 

It  would  also  be  interesting  to  experiment  with 
the  tasks  that  are  more  difficult  for  pattern 
matching,  such  as  determining  the  on-the-job 
status  property  in  MUC-6.  Since  events  often  span 
multiple  sentences,  another  direction  is  to  explore 
cross-sentence  models,  which  is  difficult  for 
traditional  approaches.  Eor  our  approach  it  is 
possible  to  extend  the  kernel  from  one  sentence  to 
multiple  sentences,  taking  into  account  the 
correlation  between  NE’s  in  adjacent  sentences. 
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